SYS 6018 | Spring 2021 | University of Virginia
H. Diana McSpadden
When natural disasters or other emergencies result in compromised communications and transportation challenges, it can be impossible within a reasonable time to locate survivors using human-only methods. This disaster relief toy problem seeks to discover if an algorithm can effectively identify images corresponding to people who require relief.
To investigate whether this is possible a data set was provided containing RGB color values corresponding to pixels. The pixels are from high-resolution images taken by an aircraft above Haiti after the 2010 earthquake. Blue tarps had been distributed to survivors, but rescue workers did not have information about where survivors relocated after receiving the blue tarps.
Blue tarps have a distinct color when compared to other identifiable elements in Haiti. The training data set includes 63,241 RGB values for pixels in the toy problem’s images. The training data set includes a label for each pixel. The labels include:Below the training data is explored, and various modeling methods are applied to determine if, and which method can be effectively used to identify blue tarps by RGB values.
# Load Required Packages
library(tidyverse)
library(pROC)
library(randomForest)
library("GGally")
library(gridExtra)
library(plotly)
library(caret)
library(boot)
library(regclass)
library(MLeval)
library(ggplot2)
library(purrr)
library(broom)filename = 'HaitiPixels.csv'
#url = 'https://collab.its.virginia.edu/access/lessonbuilder/item/1707832/group/17f014a1-d43d-4c78-a5c6-698a9643404f/Module3/HaitiPixels.csv' #this url is beng
haiti <- read_csv(filename)
print(dim(haiti))#> [1] 63241 4
Below are the first 6 rows of the training dataset.
head(haiti)#> # A tibble: 6 x 4
#> Class Red Green Blue
#> <chr> <dbl> <dbl> <dbl>
#> 1 Vegetation 64 67 50
#> 2 Vegetation 64 67 50
#> 3 Vegetation 64 66 49
#> 4 Vegetation 75 82 53
#> 5 Vegetation 74 82 54
#> 6 Vegetation 72 76 52
The dataframe contains 4 columns, and 63,241 rows. The Class column contains the correct label for the observation. Red, Green and Blue parameters are each values between 0 and 255 used in the additive RBG color model.
To prepare the data for exploratory data analysis I make Class a factor.
haiti %>%
mutate(Class = factor(Class)) #> # A tibble: 63,241 x 4
#> Class Red Green Blue
#> <fct> <dbl> <dbl> <dbl>
#> 1 Vegetation 64 67 50
#> 2 Vegetation 64 67 50
#> 3 Vegetation 64 66 49
#> 4 Vegetation 75 82 53
#> 5 Vegetation 74 82 54
#> 6 Vegetation 72 76 52
#> 7 Vegetation 71 72 51
#> 8 Vegetation 69 70 49
#> 9 Vegetation 68 70 49
#> 10 Vegetation 67 70 50
#> # ... with 63,231 more rows
Examine the numbers and percentages in each of the 5 classes:
haiti %>%
group_by(Class) %>%
summarize(N = n()) %>%
mutate(Perc = round(N / sum(N), 2) * 100)#> # A tibble: 5 x 3
#> Class N Perc
#> * <chr> <int> <dbl>
#> 1 Blue Tarp 2022 3
#> 2 Rooftop 9903 16
#> 3 Soil 20566 33
#> 4 Various Non-Tarp 4744 8
#> 5 Vegetation 26006 41
The records are not evenly distributed between the categories. Of the Classes, Blue Tarp, our “positive” category if we are thinking a binary positive/negative identification, is only 3% of our sample. Soil and Vegetation make up the majority of our sample at 74%.
It will be interesting to see performance predicting each of these categories, or a binary is or is not Blue Tarp.
After reviewing box plots for the 2-class data set, I also created two new calculated variables:
1. GBSqr = (Green + Blue)^2 * .001
2. RBSqr = (Red + Blue)^2 * .001
I created these to continue using the Red and Green values, but I wanted to increase the difference in median value difference between the positive and negative classes. There is significant interplay in color values between Red, Green, and Blue in identifying the correct shade or blue, and I want to continue using Red and Green values but increase the linear separability between the classes. The 0.01 multiplier is to return the number scale to a range similar to standard RGB values, i.e, a manual “scaling” of the new parameters.
haitiBinary = haiti %>%
mutate(ClassBinary = if_else(Class == 'Blue Tarp', '1', '0'), ClassBinary = factor(ClassBinary))
haitiBinarySqrs = haiti %>%
mutate(GBSqr = I(((Green + Blue)^2) * .001), RBSqr = I(((Red + Blue)^2) * .001), ClassBinary = if_else(Class == 'Blue Tarp', '1', '0'), ClassBinary = factor(ClassBinary))Re-examine the numbers and percentages in each of the 2 classes:
haitiBinary %>%
group_by(ClassBinary) %>%
summarize(N = n()) %>%
mutate(Perc = round(N / sum(N), 2) * 100)#> # A tibble: 2 x 3
#> ClassBinary N Perc
#> * <fct> <int> <dbl>
#> 1 0 61219 97
#> 2 1 2022 3
redplot <- ggplot(haiti, aes(x=Class, y=Red)) +
geom_boxplot(col='red')
greenplot <- ggplot(haiti, aes(x=Class, y=Green)) +
geom_boxplot(col='darkgreen')
blueplot <- ggplot(haiti, aes(x=Class, y=Blue)) +
geom_boxplot(col='darkblue')
grid.arrange(redplot, greenplot, blueplot)redplotB <- ggplot(haitiBinary, aes(x=ClassBinary, y=Red)) +
geom_boxplot(col='red')
greenplotB <- ggplot(haitiBinary, aes(x=ClassBinary, y=Green)) +
geom_boxplot(col='darkgreen')
blueplotB <- ggplot(haitiBinary, aes(x=ClassBinary, y=Blue)) +
geom_boxplot(col='darkblue')
grid.arrange(redplotB, greenplotB, blueplotB) ### How are red, blue and green values distributed between the 2 classes with the square values for Red + Blue and Green Blue?
redplotB <- ggplot(haitiBinarySqrs, aes(x=ClassBinary, y=RBSqr)) +
geom_boxplot(col='red')
greenplotB <- ggplot(haitiBinarySqrs, aes(x=ClassBinary, y=GBSqr)) +
geom_boxplot(col='darkgreen')
blueplotB <- ggplot(haitiBinarySqrs, aes(x=ClassBinary, y=Blue)) +
geom_boxplot(col='darkblue')
grid.arrange(redplotB, greenplotB, blueplotB)For the 5-class box plots:
“Blue Tarp” as the “positive” result, and other results as the “negative” result.
Regarding the box plot of the five categories, of interest is that “Soil” and “Vegetation” are relatively unique in their RGB distributions. “Rooftop” and “Various Non-Tarp” are more similar in their RBG distributions
For the 2-class box plots:
If the classes are collapsed to binary values of “Blue Tarp (1)” and “Not Blue Tarp (0)” there is little overlap in the blue values for the two classes, and the ranges of red and green are much smaller for blue tarp than non-blue-tarp.
Generally, the values of red have a larger range for negative results than for positive results, and the positive results have a similar median to the negative results.
Green values have a larger range for negative results than for positive results, and the positive results have a higher median than the negative results.
There is almost no overlap in the blue data with non-blue tarps, and blue tarps.
For the 2-class box plots with the additive square values:
If the classes are collapsed to binary values of “Blue Tarp (1)” and “Not Blue Tarp (0)” there is little overlap in the blue values for the two classes, and the RBSqr and GBSqr values have much less overlap than without the additive square variables.
The values of RBSqr have a larger range for negative results than for negative results, and median is significantly greater in the positive results.
GBSqr values have a larger range for negative results than for positive results. The positive results have a significantly higher median than the negative results.
There is almost no overlap in the blue data with non-blue tarps, and blue tarps.
These correlations make sense as the pixels were of highly saturated/“additive” colors. There are few pixels in the data set with low values for R,G,B.
#ggpairs(haiti, lower = list(continuous = "points", combo = "dot_no_facet"), progress = F)
ggpairs(haiti, progress = F)#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#ggpairs(haiti, lower = list(continuous = "points", combo = "dot_no_facet"), progress = F)
ggpairs(haitiBinary[-1], progress = F)#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggpairs(haitiBinarySqrs[-1], progress = F)#> Warning: Computation failed in `stat_density()`:
#> attempt to apply non-function
#> Warning: Computation failed in `stat_density()`:
#> attempt to apply non-function
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> Warning: Computation failed in `stat_bin()`:
#> attempt to apply non-function
#> Warning: Computation failed in `stat_bin()`:
#> attempt to apply non-function
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> Warning: Computation failed in `stat_bin()`:
#> attempt to apply non-function
#> Warning: Computation failed in `stat_bin()`:
#> attempt to apply non-function
The RBSqr and GBSqr have significantly less variance in their values, and better differentiation between the 2 classes than the Red and Green variables. I will be using these transformed variables in my models.
To view the relationship between the Red, Green, and Blue values between the five classes, and the binary classes, an interactive 3-D scatter plot is illustrative.
fiveCat3D = plot_ly(x=haiti$Red, y=haiti$Blue, z=haiti$Green, type="scatter3d", mode="markers", color=haiti$Class, colors = c('blue2','azure4','chocolate4','coral2','chartreuse4'),
marker = list(symbol = 'circle', sizemode = 'diameter', opacity =0.35))
fiveCat3D = fiveCat3D %>%
layout(title="5 Category RBG Plot", scene = list(xaxis = list(title = "Red", color="red"), yaxis = list(title = "Blue", color="blue"), zaxis = list(title = "Green", color="green")))
fiveCat3D5-Class 3-D Scatter Plot Observations
One can see that there are discernible groupings of pixel categories by RGB values. Unsurprisingly, the blue tarps are higher blue values, but they do have a range of red and green values.
The 3D scatter plot is particularly useful because, by zooming in, one can see that while the ‘Blue Tarp’ values are generally distinct, there is a space in the 3D plot with mingling of “blue tarp” pixels and other pixel categories. That area of the data will provide a challenge for our model.
binary3D = plot_ly(x=haitiBinarySqrs$RBSqr, y=haitiBinarySqrs$Blue, z=haitiBinarySqrs$GBSqr, type="scatter3d", mode="markers", color=haitiBinary$ClassBinary, colors = c('red','blue2'),
marker = list(symbol = 'circle', sizemode = 'diameter', opacity =0.35))
binary3D = binary3D %>%
layout(title="Binary RBG Plot", scene = list(xaxis = list(title = "RBSqr", color="red"), yaxis = list(title = "Blue", color="blue"), zaxis = list(title = "GBSqr", color="green")))
binary3D2-Class 3-D Scatter Plot Observations With Blue, GBSqr, and RBSqr
Similar to the five category 3D scatter plot, the binary scatter plot shows distinct groupings for blue tarp and non-blue-tarp. Visually, there is an near-distinct linear boundary between the blue tarp and non-blue tarp observations.
library(caret)
library(boot)For logistic regression, LDA, QDA, KNN and Penalized Logistic Regression Cross-Validation threshold performance use ROC and Accuracy for tuning.
The following performance measures are collected for both the 10-fold cross-validation and the hold-out/testing/validation data:
For the Models: * No: Not a Blue Tarp is Negative * Yes: Is a Blue Tarp is Positive
In order to use R’s built-in factor function I set ClassBinary to a factor and order it “No”, “Yes”.
I also review the factor counts and create a dataframe named “train”.
levels(haitiBinarySqrs$ClassBinary)#> [1] "0" "1"
levels(haitiBinarySqrs$ClassBinary)=c("No","Yes")
levels(haitiBinarySqrs$ClassBinary)#> [1] "No" "Yes"
fct_count(haitiBinarySqrs$ClassBinary)#> # A tibble: 2 x 2
#> f n
#> <fct> <int>
#> 1 No 61219
#> 2 Yes 2022
train = haitiBinarySqrsUsing 10-fold cross validation and p values in the collection (.1,.2,.3,.35,.4,.45, .5,.55,.6,.65,.7,.75,.8,.85,.9,.95,1.0) I tested three models:
** Model 1: Greatest Complexity:** \[ClassBinary = Blue + GBSqr + RBSqr \hspace{.3cm} | \hspace{.3cm} GBSqr = (Green + Blue)^2 \hspace{.3cm} and \hspace{.3cm} RBSqr = (Red + Blue)^2\]
** Model 2: Standard Additive Model:** \[ClassBinary = Blue + Green + Red\]
** Model 3: Simple Logistic Regression Model:** \[ClassBinary = Blue\]
library(yardstick)#> For binary classification, the first factor level is assumed to be the event.
#> Use the argument `event_level = "second"` to alter this as needed.
#>
#> Attaching package: 'yardstick'
#> The following objects are masked from 'package:caret':
#>
#> precision, recall, sensitivity, specificity
#> The following object is masked from 'package:readr':
#>
#> spec
set.seed(1976)
#Randomly shuffle the data
haitiBinarySqrsShuffled = haitiBinarySqrs[sample(nrow(haitiBinarySqrs)),]
#Create 10 equally size folds
folds <- cut(seq(1,nrow(haitiBinarySqrsShuffled)),breaks=10,labels=FALSE)
# create storage variables for the p value, ROC, Specificity, and Sensitivity
k_start = 1
k_end = 17
out_nvar = k_end - k_start + 1
p_i = rep(NA, out_nvar)
# complex model measures
sensitivity_c_i = rep(NA, out_nvar)
specificity_c_i = rep(NA, out_nvar)
prec_c_i = rep(NA, out_nvar)
# additive model measures
sensitivity_a_i = rep(NA, out_nvar)
specificity_a_i = rep(NA, out_nvar)
prec_a_i = rep(NA, out_nvar)
# simple model measures
sensitivity_s_i = rep(NA, out_nvar)
specificity_s_i = rep(NA, out_nvar)
prec_s_i = rep(NA, out_nvar)
counter = 1
for (j in c(.1,.2,.3,.35,.4,.45, .5,.55,.6,.65,.7,.75,.8,.85,.9,.95,1.0))
{
p_i[counter] = j
accumulator_c_sens = 0
accumulator_c_spec = 0
accumulator_c_prec = 0
accumulator_a_sens = 0
accumulator_a_spec = 0
accumulator_a_prec = 0
accumulator_s_sens = 0
accumulator_s_spec = 0
accumulator_s_prec = 0
#Perform 10 fold cross validation
for(i in 1:10) {
#Segement your data by fold using the which() function
testIndexes = which(folds==i,arr.ind=TRUE)
testData = haitiBinarySqrsShuffled[testIndexes, ]
trainData = haitiBinarySqrsShuffled[-testIndexes, ]
# define complex model
glm.fits.complex = glm(ClassBinary ~ Blue+Green+Red+GBSqr+RBSqr, binomial, data = trainData)
# define additive model
glm.fits.additive = glm(ClassBinary ~ Blue+Green+Red, binomial, data = trainData)
# define simple model
glm.fits.simple = glm(ClassBinary ~ Blue, binomial, data = trainData)
# fit the complex model
glm.pred.complex = glm.fits.complex %>% augment(newdata=testData) %>%
dplyr::select(ClassBinary, .fitted) %>%
mutate(ClassBinary=factor(ClassBinary))%>%
mutate(.prediction=ifelse(1 - 1/(1 + exp(.fitted)) < j, "No", "Yes")) %>%
mutate(.prediction=fct_relevel(as_factor(.prediction), c("No", "Yes")))
if (nlevels(glm.pred.complex$.prediction) > 1)
{
accumulator_c_sens = accumulator_c_sens + yardstick::sens(glm.pred.complex, ClassBinary, .prediction)[[3]]
accumulator_c_spec = accumulator_c_spec + yardstick::spec(glm.pred.complex, ClassBinary, .prediction)[[3]]
accumulator_c_prec = accumulator_c_prec + yardstick::precision(glm.pred.complex, ClassBinary, .prediction)[[3]]
}
# fit the additive model
glm.pred.additive = glm.fits.additive %>% augment(newdata=testData) %>%
dplyr::select(ClassBinary, .fitted) %>%
mutate(ClassBinary=factor(ClassBinary))%>%
mutate(.prediction=ifelse(1 - 1/(1 + exp(.fitted)) < j, "No", "Yes")) %>%
mutate(.prediction=fct_relevel(as_factor(.prediction), c("No", "Yes")))
if (nlevels(glm.pred.additive$.prediction) > 1)
{
accumulator_a_sens = accumulator_a_sens + yardstick::sens(glm.pred.additive, ClassBinary, .prediction)[[3]]
accumulator_a_spec = accumulator_a_spec + yardstick::spec(glm.pred.additive, ClassBinary, .prediction)[[3]]
accumulator_a_prec = accumulator_a_prec + yardstick::precision(glm.pred.additive, ClassBinary, .prediction)[[3]]
}
# fit the simple model
glm.pred.simple= glm.fits.simple %>% augment(newdata=testData) %>%
dplyr::select(ClassBinary, .fitted) %>%
mutate(ClassBinary=factor(ClassBinary))%>%
mutate(.prediction=ifelse(1 - 1/(1 + exp(.fitted)) < j, "No", "Yes")) %>%
mutate(.prediction=fct_relevel(as_factor(.prediction), c("No", "Yes")))
if (nlevels(glm.pred.simple$.prediction) > 1)
{
accumulator_s_sens = accumulator_s_sens + yardstick::sens(glm.pred.simple, ClassBinary, .prediction)[[3]]
accumulator_s_spec = accumulator_s_spec + yardstick::spec(glm.pred.simple, ClassBinary, .prediction)[[3]]
accumulator_s_prec = accumulator_s_prec + yardstick::precision(glm.pred.simple, ClassBinary, .prediction)[[3]]
}
}
sensitivity_c_i[counter] = accumulator_c_sens / 10
specificity_c_i[counter] = accumulator_c_spec / 10
prec_c_i[counter] = accumulator_c_prec /10
sensitivity_a_i[counter] = accumulator_a_sens / 10
specificity_a_i[counter] = accumulator_a_spec / 10
prec_a_i[counter] = accumulator_a_prec /10
sensitivity_s_i[counter] = accumulator_s_sens / 10
specificity_s_i[counter] = accumulator_s_spec / 10
prec_s_i[counter] = accumulator_s_prec /10
counter = counter + 1
}
outcome = data.frame(p_i, sensitivity_c_i, specificity_c_i, prec_c_i, sensitivity_a_i, specificity_a_i, prec_a_i, sensitivity_s_i, specificity_s_i, prec_s_i)
print(outcome, n = nrow(outcome))#> Error in print.default(m, ..., quote = quote, right = right, max = max): invalid 'na.print' specification
SLR Model 3: ClassBinary = Blue
Model 3, with the square terms, performed better on all measures.
Because geographic distribution of blue tarps is not included in this data, I will focus on precision and specificity, i.e. the model that identifies the most blue tarps. p = 0.1 identified the most blue tarps, and had the highest precision when using 10-fold cross validation.
Below I use the entire training data set on the model with p = 0.1 to get the training Accuracy, TPR, FPR, and Precision:
set.seed(1976)
#Randomly shuffle the data
haitiBinarySqrsShuffled = haitiBinarySqrs[sample(nrow(haitiBinarySqrs)),]
#Create 10 equally size folds
folds <- cut(seq(1,nrow(haitiBinarySqrsShuffled)),breaks=10,labels=FALSE)
out_lr.complex = tibble()
#Perform 10 fold cross validation
for(i in 1:10) {
#Segement your data by fold using the which() function
testIndexes = which(folds==i,arr.ind=TRUE)
testData = haitiBinarySqrsShuffled[testIndexes, ]
trainData = haitiBinarySqrsShuffled[-testIndexes, ]
# define complex model
glm.fits.complex = glm(ClassBinary ~ Blue+Green+Red+GBSqr+RBSqr, binomial, data = trainData)
# fit the complex model
glm.pred.complex = glm.fits.complex %>% augment(newdata=testData) %>%
dplyr::select(ClassBinary, .fitted) %>%
mutate(ClassBinary=factor(ClassBinary))%>%
mutate(.prediction=ifelse(1 - 1/(1 + exp(.fitted)) < .1, "No", "Yes")) %>%
mutate(.prediction=fct_relevel(as_factor(.prediction), c("No", "Yes")))
out_lr.complex = bind_rows(out_lr.complex,
tibble(truth = testData$ClassBinary,
prediction = glm.pred.complex$.prediction,
fitted = glm.pred.complex$.fitted))
}
caret::confusionMatrix(out_lr.complex$prediction, out_lr.complex$truth)#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction No Yes
#> No 60987 41
#> Yes 232 1981
#>
#> Accuracy : 0.9957
#> 95% CI : (0.9951, 0.9962)
#> No Information Rate : 0.968
#> P-Value [Acc > NIR] : < 2.2e-16
#>
#> Kappa : 0.9333
#>
#> Mcnemar's Test P-Value : < 2.2e-16
#>
#> Sensitivity : 0.9962
#> Specificity : 0.9797
#> Pos Pred Value : 0.9993
#> Neg Pred Value : 0.8952
#> Prevalence : 0.9680
#> Detection Rate : 0.9644
#> Detection Prevalence : 0.9650
#> Balanced Accuracy : 0.9880
#>
#> 'Positive' Class : No
#>
For Logistic Regression, my calculations for Accuracy, TPR, FPR, and Precision treat ‘Yes’, i.e. “yes, it is a blue tarp”, as the positive class.
The following cross-validation measures were calculated with a threshold, p, of 0.1:library(ROCR)
##produce the numbers associated with classification table
rates = prediction(out_lr.complex$fitted, out_lr.complex$truth)
##store the true positive and false postive rates
roc_result = performance(rates, measure="tpr", x.measure="fpr")
##plot ROC curve and overlay the diagonal line for random guessing
plot(roc_result, main="ROC Curve for Haiti Complex Logisitic Regression")
lines(x = c(0,1), y = c(0,1), col="red")##compute the AUC
auc = performance(rates, measure = "auc")
auc@y.values[[1]]#> [1] 0.9995643
The Logistic Regression ROC-AUC for the 10-fold cross-validated training data with p=0.1 is: 0.9996.**
I trained the LDA model using 10-fold cross validation.
10-fold cross validation was used for p-values from 0.05 to 1 at 0.05 intervals. Tuning was performed using ROC.
set.seed(1976)
# create storage variables for the p value, ROC, Specificity, and Sensitivity
k_start = 1
k_end = 20
out_nvar = k_end - k_start + 1
p_i = rep(NA, out_nvar)
sensitivity_i = rep(NA, out_nvar)
specificity_i = rep(NA, out_nvar)
ROC_i = rep(NA, out_nvar)
for (i in 1:20)
{
p_i[i] = i/20
trctrl <- trainControl(method = "repeatedcv", summaryFunction=twoClassSummary, classProbs=T, savePredictions = T, p = i/20)
lda.cv.model = train(ClassBinary ~ Blue+Green+Red+GBSqr+RBSqr, data = train, method = "lda", trControl=trctrl, tuneLength = 10)
sensitivity_i[i] = lda.cv.model$results[['Sens']]
specificity_i[i] = lda.cv.model$results[['Spec']]
ROC_i[i] = lda.cv.model$results[['ROC']]
}
outcome = data.frame(p_i,sensitivity_i, specificity_i, ROC_i)outcome %>% filter(sensitivity_i == max(sensitivity_i))#> p_i sensitivity_i specificity_i ROC_i
#> 1 0.55 0.9992813 0.8471833 0.9944974
outcome %>% filter(specificity_i == max(specificity_i))#> p_i sensitivity_i specificity_i ROC_i
#> 1 1 0.9992486 0.8476564 0.9944919
outcome %>% filter(ROC_i == max(ROC_i))#> p_i sensitivity_i specificity_i ROC_i
#> 1 0.8 0.9992649 0.8466834 0.9945253
For LDA, p-values of 0.55, 0.8, and 1.0 are interesting.
All the values are similar, and any of these three could be a valid choice based on considerations of the risk of missing a true blue tarp, which could result in the loss of human life by not identifying the location of survivors, or the risk of wasted time of aid workers by sending them to an incorrectly predicted location of a blue tarp.
I am selecting the value with the greatest ROC, p == 0.80..
trctrl.lda.selected <- trainControl(method = "repeatedcv", summaryFunction=twoClassSummary, classProbs=T, savePredictions = T, p = 0.8)
#trctrl.lda.selected <- trainControl(method = "repeatedcv", summaryFunction=twoClassSummary, classProbs=T, savePredictions = T, p = 1.0)
lda.cv.model.selected = train(ClassBinary ~ Blue+Green+Red+GBSqr+RBSqr, data = train, method = "lda", trControl=trctrl.lda.selected, tuneLength = 10)#> Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
#> in the result set. ROC will be used instead.
caret::confusionMatrix(lda.cv.model.selected)#> Cross-Validated (10 fold, repeated 1 times) Confusion Matrix
#>
#> (entries are percentual average cell counts across resamples)
#>
#> Reference
#> Prediction No Yes
#> No 96.7 0.5
#> Yes 0.1 2.7
#>
#> Accuracy (average) : 0.9944
For LDA, my calculations for TPR, FPR, and Precision treat ‘Yes’, i.e. “yes, it is a blue tarp”, as the positive class.
result.lda = evalm(lda.cv.model.selected)#> ***MLeval: Machine Learning Model Evaluation***
#> Input: caret train function object
#> Averaging probs.
#> Group 1 type: repeatedcv
#> Observations: 63241
#> Number of groups: 1
#> Observations per group: 63241
#> Positive: Yes
#> Negative: No
#> Group: Group 1
#> Positive: 2022
#> Negative: 61219
#> ***Performance Metrics***
#> Group 1 Optimal Informedness = 0.907667748075535
#> Group 1 AUC-ROC = 0.99
result.lda$roq#> NULL
The LDA ROC-AUC for the 10-fold cross-validated training data with p=0.8 is: 0.99.
From the curves provided in the MLeval library one can see that precision (positive preditive value) drops off faster with an increase in sensitivity/recall,
I trained the QDA model using 10-fold cross validation.
10-fold cross validation was used for p-values from 0.05 to 1 at 0.05 intervals. Tuning was performed using ROC.
set.seed(1976)
# create storage variables for the p value, ROC, Specificity, and Sensitivity
k_start = 1
k_end = 20
out_nvar = k_end - k_start + 1
p_i = rep(NA, out_nvar)
sensitivity_i = rep(NA, out_nvar)
specificity_i = rep(NA, out_nvar)
ROC_i = rep(NA, out_nvar)
for (i in 1:20)
{
p_i[i] = i/20
trctrl <- trainControl(method = "repeatedcv", summaryFunction=twoClassSummary, classProbs=T, savePredictions = T, p = i/20)
qda.cv.model = train(ClassBinary ~ Blue+Green+Red+GBSqr+RBSqr, data = train, method = "qda", trControl=trctrl, tuneLength = 10)
sensitivity_i[i] = qda.cv.model$results[['Sens']]
specificity_i[i] = qda.cv.model$results[['Spec']]
ROC_i[i] = qda.cv.model$results[['ROC']]
}
outcome = data.frame(p_i,sensitivity_i, specificity_i, ROC_i)
outcome#> p_i sensitivity_i specificity_i ROC_i
#> 1 0.05 0.9977785 0.8946691 0.9973744
#> 2 0.10 0.9977785 0.8936814 0.9974080
#> 3 0.15 0.9977621 0.8936839 0.9973791
#> 4 0.20 0.9977785 0.8936619 0.9973530
#> 5 0.25 0.9977785 0.8941594 0.9973198
#> 6 0.30 0.9977785 0.8931766 0.9973839
#> 7 0.35 0.9977785 0.8931693 0.9973509
#> 8 0.40 0.9977785 0.8941789 0.9973242
#> 9 0.45 0.9977948 0.8941496 0.9973749
#> 10 0.50 0.9977948 0.8926572 0.9973817
#> 11 0.55 0.9977785 0.8941691 0.9973978
#> 12 0.60 0.9977621 0.8941521 0.9973767
#> 13 0.65 0.9977785 0.8941643 0.9974098
#> 14 0.70 0.9977621 0.8942033 0.9974072
#> 15 0.75 0.9977785 0.8946422 0.9973973
#> 16 0.80 0.9978111 0.8936595 0.9973844
#> 17 0.85 0.9977948 0.8951617 0.9974251
#> 18 0.90 0.9978111 0.8941765 0.9973929
#> 19 0.95 0.9977621 0.8936863 0.9974052
#> 20 1.00 0.9978111 0.8941374 0.9973825
For QDA, the following p-values are interesting
I will select p = 0.85 for the QDA model as it has near the greatest ROC and greatest specificity.
trctrl.qda.selected <- trainControl(method = "repeatedcv", summaryFunction=twoClassSummary, classProbs=T, savePredictions = T, p = 0.85)
qda.cv.model.selected = train(ClassBinary ~ Blue+Green+Red+GBSqr+RBSqr, data = train, method = "qda", trControl=trctrl.lda.selected, tuneLength = 10)#> Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
#> in the result set. ROC will be used instead.
caret::confusionMatrix(qda.cv.model.selected)#> Cross-Validated (10 fold, repeated 1 times) Confusion Matrix
#>
#> (entries are percentual average cell counts across resamples)
#>
#> Reference
#> Prediction No Yes
#> No 96.6 0.3
#> Yes 0.2 2.9
#>
#> Accuracy (average) : 0.9944
For QDA, my calculations for TPR, FPR, and Precision treat ‘Yes’, i.e. “yes, it is a blue tarp”, as the positive class.
result.qda = evalm(qda.cv.model.selected)#> ***MLeval: Machine Learning Model Evaluation***
#> Input: caret train function object
#> Averaging probs.
#> Group 1 type: repeatedcv
#> Observations: 63241
#> Number of groups: 1
#> Observations per group: 63241
#> Positive: Yes
#> Negative: No
#> Group: Group 1
#> Positive: 2022
#> Negative: 61219
#> ***Performance Metrics***
#> Group 1 Optimal Informedness = 0.945173219869338
#> Group 1 AUC-ROC = 1
result.qda$rocThe QDA ROC-AUC for the 10-fold cross-validated training data with p=0.85 is: 1.0.
set.seed(1976)
trControl.knn <- trainControl(method = "repeatedcv", summaryFunction=twoClassSummary, classProbs=T, savePredictions = T, p = 0.5)
knn.cv.model = train(ClassBinary ~ Blue+Green+Red+GBSqr+RBSqr, data = train, method = "knn", trControl=trControl.knn, tuneGrid = expand.grid(k = 1:21))
knn.cv.model#> k-Nearest Neighbors
#>
#> 63241 samples
#> 5 predictor
#> 2 classes: 'No', 'Yes'
#>
#> No pre-processing
#> Resampling: Cross-Validated (10 fold, repeated 1 times)
#> Summary of sample sizes: 56916, 56917, 56917, 56917, 56917, 56917, ...
#> Resampling results across tuning parameters:
#>
#> k ROC Sens Spec
#> 1 0.9763746 0.9985135 0.9431400
#> 2 0.9897653 0.9983665 0.9470858
#> 3 0.9945602 0.9984482 0.9549968
#> 4 0.9968378 0.9983992 0.9545140
#> 5 0.9976442 0.9984645 0.9574770
#> 6 0.9981688 0.9983502 0.9559942
#> 7 0.9984338 0.9983175 0.9609374
#> 8 0.9986851 0.9983829 0.9574721
#> 9 0.9989355 0.9982358 0.9574818
#> 10 0.9992060 0.9983339 0.9535214
#> 11 0.9992128 0.9983829 0.9569770
#> 12 0.9994889 0.9983992 0.9554943
#> 13 0.9994816 0.9983992 0.9564795
#> 14 0.9994959 0.9984319 0.9574672
#> 15 0.9994967 0.9984645 0.9554919
#> 16 0.9994996 0.9983665 0.9549895
#> 17 0.9995099 0.9984155 0.9545018
#> 18 0.9995096 0.9983992 0.9544993
#> 19 0.9995057 0.9984319 0.9530191
#> 20 0.9995045 0.9983829 0.9525289
#> 21 0.9994990 0.9983502 0.9530191
#>
#> ROC was used to select the optimal model using the largest value.
#> The final value used for the model was k = 17.
set.seed(1976)
knn.cv.model = train(ClassBinary ~ Blue+Green+Red+GBSqr+RBSqr, data = train, method = "knn", trControl=trctrl, tuneGrid = expand.grid(k = 17:35))
knn.cv.model#> k-Nearest Neighbors
#>
#> 63241 samples
#> 5 predictor
#> 2 classes: 'No', 'Yes'
#>
#> No pre-processing
#> Resampling: Cross-Validated (10 fold, repeated 1 times)
#> Summary of sample sizes: 56916, 56917, 56917, 56917, 56917, 56917, ...
#> Resampling results across tuning parameters:
#>
#> k ROC Sens Spec
#> 17 0.9995099 0.9984155 0.9549968
#> 18 0.9995096 0.9984155 0.9535117
#> 19 0.9995057 0.9984319 0.9530191
#> 20 0.9995045 0.9983992 0.9515315
#> 21 0.9994990 0.9983502 0.9530191
#> 22 0.9995005 0.9983665 0.9535141
#> 23 0.9994977 0.9983338 0.9545018
#> 24 0.9994941 0.9983502 0.9545042
#> 25 0.9994904 0.9983175 0.9545018
#> 26 0.9994910 0.9983829 0.9549968
#> 27 0.9997324 0.9983665 0.9549968
#> 28 0.9997300 0.9983992 0.9540140
#> 29 0.9997288 0.9983665 0.9535141
#> 30 0.9997258 0.9983665 0.9530215
#> 31 0.9997237 0.9983829 0.9545042
#> 32 0.9997222 0.9983338 0.9530191
#> 33 0.9997161 0.9983338 0.9530215
#> 34 0.9997163 0.9983012 0.9525265
#> 35 0.9997147 0.9982848 0.9530215
#>
#> ROC was used to select the optimal model using the largest value.
#> The final value used for the model was k = 27.
I order to speed up this Rmd file I am no longer evaluating the following code chunk that evaluated k = 27 - 51. When this chunk is run, the best k was still 27.
set.seed(1976)
knn.cv.model = train(ClassBinary ~ Blue+Green+Red+GBSqr+RBSqr, data = train, method = "knn", trControl=trctrl, tuneGrid = expand.grid(k = 27:51))
knn.cv.model
From 1 - 51, the best k is 27. The tables of ROC, Sensitivity and Specificity were reviewed for each cross-validation training. From these tables one can see that the improvements within the range are in the hundredths of a percentage point of ROC, so k’s in the range of 10 - 51, appear reasonable selections for the cross-validated training data.
Running manual KNN cross validation to tune for p took significant time for my laptop to process. I ran with smaller lists of p just to get a feel for the results. I had previously ran with p = 0.5 and received accuracy of 0.996.
For processing speed time, I commented out the extending loop of threshold values, and limited the list of 0.4,0.5,0.6 for illustrative purposes.
library(class)
out_knn_p = tibble()
haitiBinarySqrsShuffled = haitiBinarySqrsShuffled %>% mutate(ClassBinaryTF = factor(if_else(ClassBinary == "No", FALSE, TRUE)))
#for (p in c(.1,.2,.3,.7,.8,.9))
#for (p in c(.2,.3,.4,.5,.6,.7,.8))
for (p in c(.4,.5,.6))
{
for (j in 1:10)
{
#Segement your data by fold using the which() function
testIndexes = which(folds==j,arr.ind=TRUE)
testData = haitiBinarySqrsShuffled[testIndexes, ]
trainData = haitiBinarySqrsShuffled[-testIndexes, ]
train_knn <- train(ClassBinary ~ Blue+Green+Red+GBSqr+RBSqr, data = trainData, method = "knn", tuneGrid = data.frame(k=27))
# test with the fold's test data
preds = predict(train_knn, newdata = testData, type="prob")
#- evaluate fold k
y_true = testData$ClassBinaryTF
# set the threshold to p
thres = p
y_hat = tibble(preds$Yes > thres)
acc = apply(y_hat, 2, function(y) mean(y == y_true))
#- output
out_knn_p = bind_rows(out_knn_p, tibble(threshold = p, accuracy = acc))
}
}
out_knn_p#> # A tibble: 30 x 2
#> threshold accuracy
#> <dbl> <dbl>
#> 1 0.4 0.997
#> 2 0.4 0.996
#> 3 0.4 0.998
#> 4 0.4 0.995
#> 5 0.4 0.998
#> 6 0.4 0.997
#> 7 0.4 0.996
#> 8 0.4 0.997
#> 9 0.4 0.997
#> 10 0.4 0.997
#> # ... with 20 more rows
#-- Get mean accuracy
perf_knn_p = out_knn_p %>%
group_by(threshold) %>%
summarize(mean_acc = mean(accuracy))
perf_knn_p %>% arrange(-mean_acc)#> # A tibble: 3 x 2
#> threshold mean_acc
#> <dbl> <dbl>
#> 1 0.5 0.997
#> 2 0.6 0.997
#> 3 0.4 0.997
After testing the following values of p: 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, p == 0.5 is the value that produced the highest mean accuracy. This is convenient because caret’s cross-validation for KNN models default to a 0.5 threshold.
caret::confusionMatrix(knn.cv.model)#> Cross-Validated (10 fold, repeated 1 times) Confusion Matrix
#>
#> (entries are percentual average cell counts across resamples)
#>
#> Reference
#> Prediction No Yes
#> No 96.6 0.1
#> Yes 0.2 3.1
#>
#> Accuracy (average) : 0.997
result.knn = evalm(knn.cv.model)#> ***MLeval: Machine Learning Model Evaluation***
#> Input: caret train function object
#> Averaging probs.
#> Group 1 type: repeatedcv
#> Observations: 63241
#> Number of groups: 1
#> Observations per group: 63241
#> Positive: Yes
#> Negative: No
#> Group: Group 1
#> Positive: 2022
#> Negative: 61219
#> ***Performance Metrics***
#> Group 1 Optimal Informedness = 0.98900668093239
#> Group 1 AUC-ROC = 1
result.knn$rocSubset selection using glmnet/ElasticNet provides an opportunity for me to see if additional predictors are of use improving the precision of our model.
I added additional terms to the model. I selected additional additive relations between the colors because RGB color is, by design, an additive color model. The interplay between Red, Blue, and Green are intuitively significant because the combinations of these values create the visible spectrum of color.
The predictors in the ElasticNet model are:haitiBinaryFull = haitiBinarySqrs %>%
mutate(RGSqr = I(((Red + Green)^2)), RGBSqr = I(((Red + Green + Blue)^2))) head(haitiBinaryFull)#> # A tibble: 6 x 9
#> Class Red Green Blue GBSqr RBSqr ClassBinary RGSqr RGBSqr
#> <chr> <dbl> <dbl> <dbl> <I<dbl>> <I<dbl>> <fct> <I<dbl>> <I<dbl>>
#> 1 Vegetation 64 67 50 13.7 13.0 No 17161 32761
#> 2 Vegetation 64 67 50 13.7 13.0 No 17161 32761
#> 3 Vegetation 64 66 49 13.2 12.8 No 16900 32041
#> 4 Vegetation 75 82 53 18.2 16.4 No 24649 44100
#> 5 Vegetation 74 82 54 18.5 16.4 No 24336 44100
#> 6 Vegetation 72 76 52 16.4 15.4 No 21904 40000
I did not scale the predictors because the glmnet library will scale to standard deviation units. For the glmnet library, the training data must be in matrix format:
frmla = as.formula(ClassBinary~Red+Green+Blue+GBSqr+RBSqr+RGSqr+RGBSqr)
x.haitiFull.train = model.matrix(frmla, data = haitiBinaryFull)[,-1] # removing the intercept term from the formula
# switch ClassBinary to TRUE FALSE to facilitate the next bit of code
y.haitiFull.train = (haitiBinaryFull %>% mutate(ClassBinary = ifelse(ClassBinary == 'No', FALSE, TRUE)))$ClassBinaryUsing penalized logistic regression (PLR), I evaluated three different PLR methods: ridge, lasso, and elasticnet (a combination of ridge and lasso). I tuned both lambda and the probability threshold (p). Based on best testing of the threshold tuning parameter, I considered p values of .1, .2, .3, .4, .5, .6, .7, .8, .85, .9, .95.
For the sake of processing speed I commented out the loop for the extended p values (0.2 was the best threshold when I considered the extended list), and kept 0.1, 0.2, 0.3 for illustrative purposes.
library(glmnet)
set.seed(1976)
# number of folds
K = 10
# make folds
folds = rep(1:K, length=nrow(x.haitiFull.train))
out = tibble()
# LOOP FOR alpha: 0 (ridge), 0.5 (elasticnet), and 1 (lasso)
for (a in c(0, .5, 1)){
# lambda may be different for the different PLR methods, so this is decided within the alpha loop
#-- Get lambda sequence so consistent over all folds
lam_seq = glmnet(x.haitiFull.train, y.haitiFull.train, family="binomial", alpha=a)$lambda
# TO DO: ADD LOOP OVER p values of .1, .2, .3, .4, .5, .6, .7, .8, .85, .9, .95
#for (p in c(.1, .2, .3, .4, .5, .6, .7, .8, .85, .9, .95)){
for (p in c(.1, .2, .3)){
#-- Loop over K folds
for(k in unique(folds)){
#- Get train/test split for fold k
ind.train = which(folds != k)
ind.test = which(folds == k)
#- fit the alpha model over all lambdas in lam_seq
fit.model = glmnet(x.haitiFull.train[ind.train,], y.haitiFull.train[ind.train],
alpha = a, # use the alpha for the loop
family="binomial", # logistic regression
lambda = lam_seq) # all models in this alpha loop use same lambda sequence
#- make predictions on test set fold k
pred = predict(fit.model, x.haitiFull.train[ind.test, ], s=lam_seq, type = "response")
#- evaluate fold k
y_true = y.haitiFull.train[ind.test]
# set the threshold to p
thres = p
y_hat = pred > thres
acc = apply(y_hat, 2, function(y) mean(y == y_true))
#- output
out = bind_rows(out,
tibble(accuracy = acc,
lambda = lam_seq,
n_train = length(ind.train),
n_test = length(ind.test),
alpha = a,
thres = p,
k = k))
}
}
}out %>% filter(accuracy == max(accuracy))#> # A tibble: 17 x 7
#> accuracy lambda n_train n_test alpha thres k
#> <dbl> <dbl> <int> <int> <dbl> <dbl> <int>
#> 1 0.997 0.0000369 56917 6324 0.5 0.2 10
#> 2 0.997 0.0000369 56917 6324 0.5 0.3 6
#> 3 0.997 0.0000336 56917 6324 0.5 0.3 6
#> 4 0.997 0.0000306 56917 6324 0.5 0.3 6
#> 5 0.997 0.0000279 56917 6324 0.5 0.3 6
#> 6 0.997 0.0000254 56917 6324 0.5 0.3 6
#> 7 0.997 0.0000232 56917 6324 0.5 0.3 6
#> 8 0.997 0.0000139 56917 6324 1 0.1 4
#> 9 0.997 0.0000127 56917 6324 1 0.1 4
#> 10 0.997 0.0000563 56917 6324 1 0.2 4
#> 11 0.997 0.0000513 56917 6324 1 0.2 4
#> 12 0.997 0.0000222 56917 6324 1 0.3 10
#> 13 0.997 0.0000202 56917 6324 1 0.3 10
#> 14 0.997 0.0000184 56917 6324 1 0.3 10
#> 15 0.997 0.0000168 56917 6324 1 0.3 10
#> 16 0.997 0.0000153 56917 6324 1 0.3 10
#> 17 0.997 0.0000139 56917 6324 1 0.3 10
#-- Get mean accuracy
perf = out %>%
group_by(alpha, thres, lambda) %>%
summarize(mean_acc = mean(accuracy), se_acc = sd(accuracy)/k) #> `summarise()` has grouped output by 'alpha', 'thres', 'lambda'. You can override using the `.groups` argument.
perf %>% arrange(-mean_acc)#> # A tibble: 8,640 x 5
#> # Groups: alpha, thres, lambda [864]
#> alpha thres lambda mean_acc se_acc
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0.2 0.0000563 0.996 0.000467
#> 2 1 0.2 0.0000563 0.996 0.000233
#> 3 1 0.2 0.0000563 0.996 0.000156
#> 4 1 0.2 0.0000563 0.996 0.000117
#> 5 1 0.2 0.0000563 0.996 0.0000933
#> 6 1 0.2 0.0000563 0.996 0.0000778
#> 7 1 0.2 0.0000563 0.996 0.0000667
#> 8 1 0.2 0.0000563 0.996 0.0000583
#> 9 1 0.2 0.0000563 0.996 0.0000519
#> 10 1 0.2 0.0000563 0.996 0.0000467
#> # ... with 8,630 more rows
log(5.630136e-05)#> [1] -9.784792
The plot below charts the change in mean accuracy as the log of lambda increases for each of the 10 folds. The red error bars display the difference between the (mean accuracy - standard error accuracy) and the (max accuracy + the standard error accuracy) - giving the range of mean accuracy for log lambda. The plot demonstrates that generally accuracy decreases, and the standard error of accuracy increases as log lambda increases. The highest mean accuracy is at a log lambda -9.78.
ggplot(perf, aes(log(lambda), mean_acc)) +
geom_point() + geom_line() +
geom_errorbar(aes(ymin=mean_acc-se_acc, ymax=mean_acc + se_acc),
color="red", alpha=1)From the manual cross-validation testing of accuracy the PLR lasso method with lambda of 5.630136e-05 and a threshold of 0.2 produced the greatest mean accuracy: 0.996.
out_confusion = tibble()
#-- Loop over K folds
for(k in unique(folds)){
#- Get train/test split for fold k
ind.train = which(folds != k)
ind.test = which(folds == k)
#- fit the lasso
cv.glmnet.model = glmnet(x.haitiFull.train[ind.train,], y.haitiFull.train[ind.train],
alpha = 1,
family="binomial", # logistic regression
lambda = lam_seq)
#- make predictions on test set fold k
pred = predict(fit.model, x.haitiFull.train[ind.test, ], s=0.00005630136, type = "response")
#- evaluate fold k
y_true = y.haitiFull.train[ind.test]
# set the threshold to p
thres = 0.2
y_hat = pred > thres
#- output
out_confusion = bind_rows(out_confusion,
tibble(truth = y_true,
glmnet.fitted = y_hat))
}The optimal lambda:
coef(cv.glmnet.model, 0.00005630136)#> 8 x 1 sparse Matrix of class "dgCMatrix"
#> 1
#> (Intercept) -9.9118040564
#> Red -0.1218057899
#> Green -0.0651983875
#> Blue 0.3230103224
#> GBSqr .
#> RBSqr .
#> RGSqr -0.0001042983
#> RGBSqr .
This resulting model is interesting. Cross-validated lasso, with p = 0.2 set the coefficients for GBSqr, RBSqr and RGBSqr to 0. This could be due to collinearity with Blue, or because the variables are truly not significant to the accuracy of the model. This did determine that the new variable RGSqr was significant.
out_confusion %>%
mutate(truth = factor(truth), glmnet.fitted = factor(glmnet.fitted)) %>%
conf_mat(truth, glmnet.fitted)#> Truth
#> Prediction FALSE TRUE
#> FALSE 61045 82
#> TRUE 174 1940
assess.glmnet(pred, newy = y.haitiFull.train[ind.test], family="binomial")#> $deviance
#> 1
#> 1.367596
#> attr(,"measure")
#> [1] "Binomial Deviance"
#>
#> $class
#> 1
#> 0.9677419
#> attr(,"measure")
#> [1] "Misclassification Error"
#>
#> $auc
#> [1] 0.9996217
#> attr(,"measure")
#> [1] "AUC"
#>
#> $mse
#> 1
#> 0.4912767
#> attr(,"measure")
#> [1] "Mean-Squared Error"
#>
#> $mae
#> 1
#> 0.9881718
#> attr(,"measure")
#> [1] "Mean Absolute Error"
AUC for the lasso PLR model with a lambda of 0.00005630136 is 0.9996.
Using lasso penalized logistic regression with a threshold = 0.2, lambda = 0.00005630136:I was curious to examine the performance of the model selected by lasso in a logistic regression model with a threshold also selected by lasso (0.2). This is the selected PLR model without the lambda penalty term.
set.seed(1976)
#Randomly shuffle the data
haitiBinarySqrsFullShuffled = haitiBinaryFull[sample(nrow(haitiBinaryFull)),]
#Create 10 equally size folds
folds <- cut(seq(1,nrow(haitiBinarySqrsFullShuffled)),breaks=10,labels=FALSE)
out_lr.full = tibble()
#Perform 10 fold cross validation
for(i in 1:10) {
#Segement your data by fold using the which() function
testIndexes = which(folds==i,arr.ind=TRUE)
testData = haitiBinarySqrsFullShuffled[testIndexes, ]
trainData = haitiBinarySqrsFullShuffled[-testIndexes, ]
# define complex model
glm.fits.full = glm(ClassBinary ~ Blue+Green+Red+RGSqr, binomial, data = trainData)
# fit the glmnet lasso model
glm.pred.full = glm.fits.full %>% augment(newdata=testData) %>%
dplyr::select(ClassBinary, .fitted) %>%
mutate(ClassBinary=factor(ClassBinary))%>%
mutate(.prediction=ifelse(1 - 1/(1 + exp(.fitted)) < .2, "No", "Yes")) %>%
mutate(.prediction=fct_relevel(as_factor(.prediction), c("No", "Yes")))
out_lr.full = bind_rows(out_lr.full,
tibble(truth = testData$ClassBinary,
prediction = glm.pred.full$.prediction,
fitted = glm.pred.full$.fitted))
}
caret::confusionMatrix(out_lr.full$prediction, out_lr.full$truth)#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction No Yes
#> No 61039 73
#> Yes 180 1949
#>
#> Accuracy : 0.996
#> 95% CI : (0.9955, 0.9965)
#> No Information Rate : 0.968
#> P-Value [Acc > NIR] : < 2.2e-16
#>
#> Kappa : 0.937
#>
#> Mcnemar's Test P-Value : 2.662e-11
#>
#> Sensitivity : 0.9971
#> Specificity : 0.9639
#> Pos Pred Value : 0.9988
#> Neg Pred Value : 0.9155
#> Prevalence : 0.9680
#> Detection Rate : 0.9652
#> Detection Prevalence : 0.9663
#> Balanced Accuracy : 0.9805
#>
#> 'Positive' Class : No
#>
##produce the numbers associated with classification table
rates.full = ROCR::prediction(out_lr.full$fitted, out_lr.full$truth)
##store the true positive and false postive rates
roc_result.full = performance(rates.full, measure="tpr", x.measure="fpr")
##plot ROC curve and overlay the diagonal line for random guessing
plot(roc_result.full, main="ROC Curve for Haiti Logisitic Regression Using Lasso Model")
lines(x = c(0,1), y = c(0,1), col="red")##compute the AUC
auc = performance(rates.full, measure = "auc")
auc@y.values[[1]]#> [1] 0.9995394
Using logistic regression with a threshold = 0.2 and the model selected by lasso:
It was interesting that this model performed slightly better on Precision, and FPR, but not as well based on other measures as the logistic regression model Red + Green + Blue + RBSqr + GBSqr.
haitiBinaryRF = haitiBinarySqrs %>%
mutate(RGSqr = I(((Red + Green)^2)), RGBSqr = I(((Red + Green + Blue)^2)), RG = (Red * Green), RB = (Red * Blue), GB = (Green * Blue))
head(haitiBinaryRF)#> # A tibble: 6 x 12
#> Class Red Green Blue GBSqr RBSqr ClassBinary RGSqr RGBSqr RG RB GB
#> <chr> <dbl> <dbl> <dbl> <I<d> <I<d> <fct> <I<d> <I<db> <dbl> <dbl> <dbl>
#> 1 Vege~ 64 67 50 13.7 13.0 No 17161 32761 4288 3200 3350
#> 2 Vege~ 64 67 50 13.7 13.0 No 17161 32761 4288 3200 3350
#> 3 Vege~ 64 66 49 13.2 12.8 No 16900 32041 4224 3136 3234
#> 4 Vege~ 75 82 53 18.2 16.4 No 24649 44100 6150 3975 4346
#> 5 Vege~ 74 82 54 18.5 16.4 No 24336 44100 6068 3996 4428
#> 6 Vege~ 72 76 52 16.4 15.4 No 21904 40000 5472 3744 3952
The first random forest tuning parameter I will test for is number of trees and mtry. I will test ntree values by cross validation.
I did run the ntree loop for 250, 500, and 1000 trees. The best performing number of trees was 1000.
I tuned with a list of mtry values of 3,4,5,6,7,8,9 and 10. 4 was the best performing. In order to lessen the amount of timed needed to knit this Rmd I am not evaluating the following code block.
control = trainControl(method = 'repeatedcv', number = 10, repeats = 3, search = 'grid')
# possible mtry values
tunegrid = expand.grid(.mtry = c(3,4,5,6,7,8,9,10))
modellist = list()
#train with different ntree parameters
for (ntree in c(250,500,1000)){
set.seed(123)
rf.fit = train(ClassBinary~Red+Blue+Green+GBSqr+RBSqr+RGSqr+RGBSqr+RG+RB+GB,
data = haitiBinaryRF,
method = 'rf',
metric = 'Accuracy',
tuneGrid = tunegrid,
trControl = control,
ntree = ntree)
modellist[[toString(ntree)]] = rf.fit
}
# look at the results of the cross validation with different ntree and mtry values
results = resamples(modellist)
summary(results)rf.fit$bestTunesummary(rf.fit$finalModel$ntree)as.data.frame(rf.fit$finalModel$importance) %>% arrange(MeanDecreaseGini)The above code took such a long time to run, I am not longer evaluating in my Rmd. Here are screen shots of the output to support my further evaluation of the selected random forest model:
RF Tuning Reuslts
RF MTRY Reuslts
RF Importance Results
Next, I need to test for the best probability threshold using the tuned tree with number of predictors considered at each split == 4, and number of trees = 1000. I will use accuracy for selecting the best p value.
# need to write own cross validation and use randomforest library
library(randomForest)
set.seed(123)
#Randomly shuffle the data
haitiRFShuffled = haitiBinaryRF[sample(nrow(haitiBinarySqrs)),]
#Create 10 equally size folds
folds <- cut(seq(1,nrow(haitiRFShuffled)),breaks=10,labels=FALSE)
head(haitiRFShuffled)#> # A tibble: 6 x 12
#> Class Red Green Blue GBSqr RBSqr ClassBinary RGSqr RGBSqr RG RB
#> <chr> <dbl> <dbl> <dbl> <I<d> <I<d> <fct> <I<db> <I<db> <dbl> <dbl>
#> 1 Roof~ 240 229 205 188. 198. No 219961 454276 54960 49200
#> 2 Vari~ 91 77 66 20.4 24.6 No 28224 54756 7007 6006
#> 3 Vege~ 70 69 51 14.4 14.6 No 19321 36100 4830 3570
#> 4 Vari~ 127 110 96 42.4 49.7 No 56169 110889 13970 12192
#> 5 Soil 255 255 218 224. 224. No 260100 529984 65025 55590
#> 6 Soil 255 233 169 162. 180. No 238144 431649 59415 43095
#> # ... with 1 more variable: GB <dbl>
I did run this loop with threshold 0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9 and 0.6 performed the best.
For the sake of faster compilation time, I am limited the threshold loop to 0.5,0.6, and 0.7 to illustrate the process.
# create storage variables for the p value, Accuracy, ROC, Specificity, and Sensitivity
all_out_yhat_ytrue = tibble()
out_yhat_ytrue = tibble()
out_rf = tibble()
#for (j in c(.2,.3,.4,.5,.6,.7,.8,.9))
for (j in c(.5,.6,.7))
{
# reset the fold accumulator tibble
out_yhat_ytrue = tibble()
#Perform 10 fold cross validation
for(i in 1:10) {
#Segment data by fold using the which() function
testIndexes = which(folds==i,arr.ind=TRUE)
testData = haitiRFShuffled[testIndexes, ]
trainingData = haitiRFShuffled[-testIndexes, ]
#small training and test sets
# comment this out to get real values
#testData = testData[1:100,]
#trainingData= head(trainingData, 500)
#end of small training and test sets
# training
rf.haiti = randomForest(ClassBinary~Red+Blue+Green+GBSqr+RBSqr+RGSqr+RGBSqr+RG+RB+GB, data = trainingData, mtry=4, ntree=1000)
# test
rf.preds = predict(rf.haiti, newdata = testData, type="prob")
#- evaluate fold
y_true = testData %>% mutate(ClassBinary = (ClassBinary == "Yes"), Class0or1 = ifelse(ClassBinary == FALSE,0,1))
y_true_val = y_true$ClassBinary
#y_true
# set the threshold to p
y_hat = rf.preds[,2] > j
#y_hat
#- output
out_yhat_ytrue = bind_rows(out_yhat_ytrue,
tibble(thres = j, true = y_true_val, hat = y_hat, truth0or1 = y_true$Class0or1, X0 = rf.preds[,1], X1 = rf.preds[,2]))
all_out_yhat_ytrue = bind_rows(all_out_yhat_ytrue,
tibble(thres = j, true = y_true_val, hat = y_hat))
}
acc = sum(out_yhat_ytrue$hat == out_yhat_ytrue$true) / nrow(out_yhat_ytrue)
out_rf = bind_rows(out_rf,
tibble(accuracy = acc, thres = j))
}out_rf#> # A tibble: 3 x 2
#> accuracy thres
#> <dbl> <dbl>
#> 1 0.997 0.5
#> 2 0.997 0.6
#> 3 0.997 0.7
Random forest cross validation with:
These tuning parameters produced the best accuracy.
Next, I produce the confusion matrix and ROC curve
# filter only the true and hat values where thres == 0.3
best_rf_df = all_out_yhat_ytrue %>%
mutate(factortrue = factor(true), factorhat = factor(hat)) %>%
filter(thres == 0.6)
head(best_rf_df)#> # A tibble: 6 x 5
#> thres true hat factortrue factorhat
#> <dbl> <lgl> <lgl> <fct> <fct>
#> 1 0.6 FALSE FALSE FALSE FALSE
#> 2 0.6 FALSE FALSE FALSE FALSE
#> 3 0.6 FALSE FALSE FALSE FALSE
#> 4 0.6 FALSE FALSE FALSE FALSE
#> 5 0.6 FALSE FALSE FALSE FALSE
#> 6 0.6 FALSE FALSE FALSE FALSE
caret::confusionMatrix(best_rf_df$factorhat, best_rf_df$factortrue)#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction FALSE TRUE
#> FALSE 61160 136
#> TRUE 59 1886
#>
#> Accuracy : 0.9969
#> 95% CI : (0.9965, 0.9973)
#> No Information Rate : 0.968
#> P-Value [Acc > NIR] : < 2.2e-16
#>
#> Kappa : 0.9493
#>
#> Mcnemar's Test P-Value : 5.255e-08
#>
#> Sensitivity : 0.9990
#> Specificity : 0.9327
#> Pos Pred Value : 0.9978
#> Neg Pred Value : 0.9697
#> Prevalence : 0.9680
#> Detection Rate : 0.9671
#> Detection Prevalence : 0.9692
#> Balanced Accuracy : 0.9659
#>
#> 'Positive' Class : FALSE
#>
pred <- ROCR::prediction(predictions, labels)#> Error in is.data.frame(predictions): object 'predictions' not found
perf <- ROCR::performance(pred, measure = "tpr", x.measure = "fpr")#> Error in ROCR::performance(pred, measure = "tpr", x.measure = "fpr"): Wrong argument types: First argument must be of type 'prediction'; second and optional third argument must be available performance measures!
plot(perf, col=rainbow(10))#?ROCR::prediction
par(mfrow=c(1,2))
roc_pred = ROCR::prediction(out_yhat_ytrue$X1, out_yhat_ytrue$truth0or1)
roc_perf = ROCR::performance(roc_pred, "tpr", "fpr")
plot(roc_perf, max = "threshold", colorize=TRUE, lwd= 3)
plot(roc_perf, avg= "vertical", spread.estimate="boxplot", lwd=2,col='blue',
show.spread.at= seq(0.1, 0.9, by=0.1),
main= "Accuracy across cutoffs")auc = ROCR::performance(roc_pred, measure = "auc")
auc = auc@y.values[[1]]
auc#> [1] 0.9942571
#plot(performance(pred, "cal", window.size= 10),
# avg="vertical",
# main= "How well are the probability predictions calibrated?")
#plot(0,0,type="n", xlim= c(0,1), ylim=c(0,7),
# xlab="Cutoff", ylab="Density",
# main="How well do the predictions separate the classes?")| Model | Tuning | AUROC | Threshold | Accuracy | TPR | FPR | Precision |
| Log Reg | N/A | 0.9996 | 0.1 | 0.9957 | 0.9802 | 0.0038 | 0.896 |
| LDA | N/A | 0.99 | 0.8 | 0.994 | 0.844 | 0.001 | 0.964 |
| QDA | N/A | 1.0 | 0.85 | 0.995 | 0.906 | 0.002 | 0.97 |
| KNN | k = 27 | 1.0 | 0.5 | 0.997 | 0.998 | 0.0313 | 0.999 |
| Pen Log Reg | alpha = 1, lambda = 5.630136e-05 | 0.9996 | 0.2 | 0.996 | 0.959 | 0.0028 | 0.918 |
| Random Forest | mtry = 4, ntree = 1000 | 0.994 | 0.6 | 0.997 | 0.932 | 0.0009 | 0.9706 |
| SVM |
Training Data Linearly Separates, but Model is More Than Just Blue
Visually, the training data linearly separates very well. Even the 5-class, un-transformed data set was visually separable in the 3-D R-G-B visualization of “Blue Tarps” vs. the other four classes. Collapsing the classes and transforming the variables to decrease variability of Red and Green further improved the linear separability. All the models used performed with 99% accuracy with cross validation using the training data. All except the Lasso model used the following formula:
\[ ClassBinary = Blue+Green+Red+GBSqr+RBSqr\] where \(GBSqr=(Green + Blue)^2\) and \(RBSqr=(Red + Blue)^2\)
The lasso model selected the formula: \[ClassBinary = Red + Green + Blue + (Red + Green)^2\]
The functions with the additive sqare terms performed better than the simplest model (ClassBinary = Blue), and slightly better than the basic additive model (ClassBinary = Blue + Red + Green).
Even though a pixel with only a blue tarp in it is easily identifiable, pixels partially representing blue tarps require more predictors to perform better than guessing.
I look forward to discovering if the hold-out/validation data set also separates as well.
Distribution of Classes and How It Affects Model Selection
Only 3% of the observations in the training data are labeled “Blue Tarp”. This is a very unbalanced training set. This is not unexpected because it would be extremely surprising if a high percentage of land area was covered by blue tarps. In fact, if that were the case, then I would expect it would not be challenging for aid workers to find survivors because survivors would be covering a high percentage of Haiti, or Haiti would just have blue tarps everywhere and they would be of little predictive power for finding survivors.
99% accuracy, while impressive sounding, needs to be considered within context. Accuracy is 97% if the model predicted “No” for every pixel; however no blue tarps would be identified and survivors would remain undiscovered by aid workers. The “best guess” scenario, with high accuracy, provides no value for the humans in our toy problem. 99% accuracy is better than guessing, and the closer to 100% accuracy we can get, the better.
Before selecting any model, I recommend working with experienced aid workers and local Haitian residents with knowledge of the land to determine the choice between, and balance of these considerations. However, in this case, if we want the model that is least likely to send aid workers to non-blue tarps we will select the tuned Logistic Regression Model WIth Sqr Terms; additionally if we want the model that correctly identified the greatest percentage of blue tarps we will also select the tuned Logistic Regression Model with Sqr Terms.
The Logistic Regression Model with Sqr Terms, while lower in Precision, performed better on other measures which are more appropriate for this problem.
If time is of the essence, KNN Took a Long Time to Tune on my Laptop: Of all the models, tuning KNN for both the highest accuracy “number of neighbors” and highest accuracy probability threshold took the most time for my laptop to complete. In fact, on my laptop, I started the threshold tuning loop and went and did chores while periodically checking for completion. I am assuming that in a real-world scenario, each night the model would be validated and tuned with additional labeled data to improve performance. Adding to this concern, I only tested neighbors in the range 1 to 51, and with 60,000+ observations a larger k may have performed better, but the time needed to test additional k tuning parameters was too great.
If time is of the essence, which in this scenario it would be because getting predictions for locations of survivors is only the first step in making a plan for volunteers to reach the survivors, it may make sense to select a model that does not take as long to produce a result. The KNN model performed almost as well as the Logistic Regression with the complex model when considering Negative Predictive Value, and False Negative Rate; however the Logistic Regression Model performed better on these two metrics. However, the KNN model performed better at AUROC, and Precision than the Logistic Regression model, but with a significant increase in time to compute.
Very small penalty for additional parameters:
The penalty (\(\lambda\)) that performed the best using elasticnet penalized regression was very small. The conclusion I draw is that while lasso performed the best as accuracy when comparing ridge, elasticnet, and lasso, the penalties for the additional predictors was very small and with such a small ratio of number of parameters to number of observations, it wasn’t significantly useful, in this case, to shrink the model.
Request For Additional Information for Real-World Implementation
I am interested in how to use limited resources to reach the greatest number of vulnerable people. Without the GIS information for each pixel I am unable to calculate locations where the most blue tarps will be found to able to help the most people; however, there may not be a 1:1 relationship between number of tarps and number of people, so resource allocation becomes even trickier. My conclusion is that without the added GIS information for the observations I cannot provide information to help an efficient distribution of aid to those affected by the earthquake. And, demographic information regarding population density, and members-per-household would be of priceless value. Additionally, GIS information for pre-earthquake roads, and other geographic features would be useful.